library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(wordcloud2)
library(word2vec)
library(ggplot2)
library(plotly)
library(uwot)

Introduction

In our daily life, there are essentially some happy moments that make us feel fulfilled and energetic. In this project, we will explore these moments from several aspects based on the Natural Language Processing (NLP) and produce a data story about 1.

Step 0 Data Process

The data we used is originated from the HappyDB which is a database records a corpus of 100,000 crowd-sourced happy moments. Before the analysis, we should first conduct the data processing to convert the data into forms that are more fitted to be analyzed. Thankfully, the file Text_Process.csv had already generated the processed data set. So we would use the data set directly.

Step 1 A Glance of Data

First, we should briefly see the data set’s structure and the variables contained in the data set.

dat <- read.csv('../output/processed_moments.csv')
head(dat,5)
##    hmid  wid reflection_period
## 1 27673 2053               24h
## 2 27674    2               24h
## 3 27675 1936               24h
## 4 27676  206               24h
## 5 27677 6227               24h
##                                                                                                                          original_hm
## 1                                                      I went on a successful date with someone I felt sympathy and connection with.
## 2                                                                           I was happy when my son got 90% marks in his examination
## 3                                                                                       I went to the gym this morning and did yoga.
## 4 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.
## 5                                                             I went with grandchildren to butterfly display at Crohn Conservatory\n
##                                                                                                                           cleaned_hm
## 1                                                      I went on a successful date with someone I felt sympathy and connection with.
## 2                                                                           I was happy when my son got 90% marks in his examination
## 3                                                                                       I went to the gym this morning and did yoga.
## 4 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.
## 5                                                             I went with grandchildren to butterfly display at Crohn Conservatory\n
##   modified num_sentence ground_truth_category predicted_category id
## 1     TRUE            1                  <NA>          affection  1
## 2     TRUE            1                  <NA>          affection  2
## 3     TRUE            1                  <NA>           exercise  3
## 4     TRUE            2               bonding            bonding  4
## 5     TRUE            1                  <NA>          affection  5
##                                                 text
## 1               connected date successfully sympathy
## 2                              examination marks son
## 3                                   gym morning yoga
## 4        evening flaky friend hang talked understood
## 5 butterfly conservatory crohn display grandchildren
colnames(dat)
##  [1] "hmid"                  "wid"                   "reflection_period"    
##  [4] "original_hm"           "cleaned_hm"            "modified"             
##  [7] "num_sentence"          "ground_truth_category" "predicted_category"   
## [10] "id"                    "text"

We could find that the variable predicted_category contain the predicted category for each data. Below, we would do some analysis relevant to this variable.

table(dat$predicted_category)
## 
##      achievement        affection          bonding enjoy_the_moment 
##            33897            34164            10726            11109 
##         exercise          leisure           nature 
##             1196             7457             1843

Through the table, we observe that the achievement and affection make up most fractions in the categories of the recorded moments. As a result, we believe that these two kind of feeling bring most happiness to people.

Step 2 Generating the Basic Word Cloud.

Above we briefly have a glance through the data. Here, we want to further explore the sentiment of the recorded sentences and extract the feeling that make people happy. Here, we first get the numbers of different words.

dat_word <- dat %>%
  unnest_tokens(word, text)
dat_word_count <- dat_word %>%
  count(word, sort = TRUE)
head(dat_word_count,5)
##      word     n
## 1  friend 10892
## 2     day  9930
## 3    time  9692
## 4  family  4692
## 5 watched  4385

After getting a bag of counts of different words, we use it to generate our first word cloud.

wordcloud2(dat_word_count[which(dat_word_count$n>300),])

Above we have generated the first word cloud. We could observe that the words like friend, day, home, etc. Though we could get many information from this graph, we still notice that there are some clear disadvantages about it. For instance, the word cloud contains too many words, making it looks quite overwhelming. To reduce the complexity of it, we only keep the words that appear more than 2,000 times in the book.

dat_word_count_l2k <- dat_word_count[which(dat_word_count$n>2000),]
wordcloud2(dat_word_count_l2k)

By limiting the size of each word, we are able to generate a more understandable word cloud containing only the most important words.

Step 3 Vectorization

By now, we have processed the data, briefly glanced over the data structure, and generated the basic word cloud showing the frequently-appeared words. Now, what we are interested is that how we find the connection between words? Are there any similarities within the words? And how we could find people’s probable interest given some triggers of his/her happiness. Here, we would use the package word2vec to carry out our works.

dat_vec <- dat
set.seed(4243)
model1 <- word2vec(x = dat_vec$text, dim = 15, iter = 20)

Now, we have the model1 trained for embedding. Than we could try to type in words we are interested in and find their nearest words. For instance, I am quite interested in game and basketball and wondering the similar things that could make me happy as well.

embed <- predict(model1, c("game", "basketball"), type = "nearest", top_n = 10)
embed
## $game
##    term1       term2 similarity rank
## 1   game   franchise  0.9045113    1
## 2   game      league  0.9023066    2
## 3   game      hockey  0.9016669    3
## 4   game    football  0.8996831    4
## 5   game        fifa  0.8943493    5
## 6   game          gt  0.8917541    6
## 7   game    baseball  0.8864343    7
## 8   game hearthstone  0.8861898    8
## 9   game          ii  0.8798506    9
## 10  game       gamer  0.8745061   10
## 
## $basketball
##         term1        term2 similarity rank
## 1  basketball     baseball  0.9707909    1
## 2  basketball         cubs  0.9377135    2
## 3  basketball         beat  0.9207459    3
## 4  basketball       hockey  0.9204223    4
## 5  basketball     defeated  0.9136161    5
## 6  basketball championship  0.9119426    6
## 7  basketball         ping  0.9061303    7
## 8  basketball        guild  0.9041461    8
## 9  basketball    champions  0.8999110    9
## 10 basketball     football  0.8979672   10

The above list 20 other things that could probably make me feel happy given that I am a game fan and a basketball fan. I might try these things by myself.

Than, we could draw an interactive plot to help us map the words we are interested in.

model2 <- word2vec(x = dat_vec$text, dim = 15, iter = 20)
embed_plot <- as.matrix(model2)
viz <- umap(embed_plot, n_neighbors = 15, n_threads = 2)
df_tmp <- data.frame(word = rownames(viz), 
                     x = viz[, 1], y = viz[, 2],
                     stringsAsFactors = FALSE)
plot_ly(df_tmp, x = ~x, y = ~y, type = "scatter",
        mode = 'text', text = ~word)

This graph looks quite messy, but we could put the pointer in the points we want to see the words.

We all know that there is a famous analogy that: \[king-man+women=queen\] Similarly, we want to know the most probable triggers of happiness for a person with friends likes to play game but does not like basketball. What we need to do is to predict the final position and find the nearest words around it.

tmp <- predict(model2, newdata = 
                 c('friend','game','basketball'),
               type = 'embedding')
tmp_form <- tmp['friend',] + tmp['game',] - tmp['basketball',]
predict(model2, newdata = tmp_form, 
        type = "nearest", top_n = 10)
##          term similarity rank
## 1        iave  0.9998181    1
## 2         hey  0.9986129    2
## 3    internet  0.9983231    3
## 4     parents  0.9981050    4
## 5        mine  0.9974617    5
## 6        iaam  0.9971441    6
## 7       human  0.9965062    7
## 8  friendship  0.9963590    8
## 9        hard  0.9950250    9
## 10    imagine  0.9940613   10

Summarize

Below we have conducted several analysis based on the HappyDB. However, here are still some more works that could be done remaining. For instance, we could generate a Neural Network by limiting the minimum occurrence of words to reduce the complexity. We could use it to determine to whom the input triggers most likely to belong.